Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search
نویسندگان
چکیده
We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R = |S1∩S2∩S3| |S1∪S2∪S3| , S1, S2, S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R resemblance over the pairwise case in improving retrieval quality.
منابع مشابه
Dual-tree fast exact max-kernel search
The problem of max-kernel search arises everywhere: given a query point pq , a set of reference objects Sr and some kernel K, find arg maxpr∈Sr K(pq , pr ). Max-kernel search is ubiquitous and appears in countless domains of science, thanks to the wide applicability of kernels. A few domains include image matching, information retrieval, bio-informatics, similarity search, and collaborative fil...
متن کاملInstance Similarity Deep Hashing for Multi-Label Image Retrieval
Hash coding has been widely used in the approximate nearest neighbor search for large-scale image retrieval. Recently, many deep hashing methods have been proposed and shown largely improved performance over traditional featurelearning-based methods. Most of these methods examine the pairwise similarity on the semantic-level labels, where the pairwise similarity is generally defined in a hard-a...
متن کاملAsymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we...
متن کاملEfficient generation of super condensed neighborhoods
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Condensed neighborhoods, however, are not a minimal representation of a pattern neighborhood. Super condensed neighborhoods, proposed in this work, are smaller, provably minimal and can be used to locate approximate matches that can later be extended by on-line search. We...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کامل